In this tutorial, you will learn one of the important step of a Data Science pipeline i.e. Exploratory Data Analysis.
Exploratory Data Analysis (EDA): In this step, we try to understand the data and the underlying interactions between different variables.
The dataset we will be using here is the wine quality data.
Data Set Information: The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. Quality is based on sensory scores (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).
All Python capabilities are not loaded to our working environment by default (even they are already installed in your system). So, we import each and every library that we want to use.
In data science, numpy and pandas are most commonly used libraries. Numpy is required for calculations like means, medians, square roots, etc. Pandas is used for data processin and data frames. We chose alias names for our libraries for the sake of our convenience (numpy --> np and pandas --> pd).
import numpy as np # Fundamental package for linear algebra and multidimensional arrays
import pandas as pd # Data analysis and manipultion tool
Pandas module is used for reading files. We have our data in '.csv' format. We will use 'read_csv()' function for loading the data.
# In read_csv() function, we have passed the location to where the files are located in the UCI website. The data is separated by ';'
# so we used separator as ';' (sep = ";")
red = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=";")
white = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv", sep=";")
Let's check how our data looks. This can be done using head() method.
# Red Wine
red.head() # we can also pass the number of records we want in the brackets (). By default it displays first 5 records.
| fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
| 1 | 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 |
| 2 | 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.9970 | 3.26 | 0.65 | 9.8 | 5 |
| 3 | 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.9980 | 3.16 | 0.58 | 9.8 | 6 |
| 4 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
# White Wine
white.head(6) # we will get first 6 records from white wine data
| fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.0 | 0.27 | 0.36 | 20.7 | 0.045 | 45.0 | 170.0 | 1.0010 | 3.00 | 0.45 | 8.8 | 6 |
| 1 | 6.3 | 0.30 | 0.34 | 1.6 | 0.049 | 14.0 | 132.0 | 0.9940 | 3.30 | 0.49 | 9.5 | 6 |
| 2 | 8.1 | 0.28 | 0.40 | 6.9 | 0.050 | 30.0 | 97.0 | 0.9951 | 3.26 | 0.44 | 10.1 | 6 |
| 3 | 7.2 | 0.23 | 0.32 | 8.5 | 0.058 | 47.0 | 186.0 | 0.9956 | 3.19 | 0.40 | 9.9 | 6 |
| 4 | 7.2 | 0.23 | 0.32 | 8.5 | 0.058 | 47.0 | 186.0 | 0.9956 | 3.19 | 0.40 | 9.9 | 6 |
| 5 | 8.1 | 0.28 | 0.40 | 6.9 | 0.050 | 30.0 | 97.0 | 0.9951 | 3.26 | 0.44 | 10.1 | 6 |
Le's explore the attributes / columns of the datasets. Both red and white wines have the same columns.
# Columns / Attribute
red.columns
Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
'pH', 'sulphates', 'alcohol', 'quality'],
dtype='object')
The target variable of a dataset is the feature of a dataset about which you want to gain a deeper understanding. It is the variable that is, or should be the output.
Here quality is the target variable as we're trying to know which of the two types of wines have a better quality.
One or more variables that are used to determine (or predict) the 'Target Variable' are known as Input Variables. They are sometimes called Predictor Variable as well.
In our example, the input variables are: 'fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density','pH', 'sulphates', and 'alcohol'.
All of these will help us predict the quality of the wine.
Variables and features both are the same, they are often used interchangeably. All the column names in a dataset are variables.
After loading the data, it is important to examine the data. It is usually not recommended to directly throw all the data into the model without understanding the data. This step always helps in improving our model.
Concatenate both the data.
# Add a column to separate whether the wine is red or white.
red['color'] = 'r'
white['color'] = 'w'
wine = pd.concat([red, white])
Let's rename the columns which contain spaces in their names.
# rename() function is used to rename the columns
# wine data
# red_wine_data
wine.rename(columns={'fixed acidity': 'fixed_acidity', 'citric acid':'citric_acid', 'volatile acidity':'volatile_acidity',
'residual sugar':'residual_sugar', 'free sulfur dioxide':'free_sulfur_dioxide', 'total sulfur dioxide':'total_sulfur_dioxide'},
inplace = True)
# red_wine_data
red.rename(columns={'fixed acidity': 'fixed_acidity', 'citric acid':'citric_acid', 'volatile acidity':'volatile_acidity',
'residual sugar':'residual_sugar', 'free sulfur dioxide':'free_sulfur_dioxide', 'total sulfur dioxide':'total_sulfur_dioxide'},
inplace = True) # inplace = True makes changes in the dataframe itself
# white_wine_data
white.rename(columns={'fixed acidity': 'fixed_acidity', 'citric acid':'citric_acid', 'volatile acidity':'volatile_acidity',
'residual sugar':'residual_sugar', 'free sulfur dioxide':'free_sulfur_dioxide', 'total sulfur dioxide':'total_sulfur_dioxide'},
inplace = True)
red.head(2)
| fixed_acidity | volatile_acidity | citric_acid | residual_sugar | chlorides | free_sulfur_dioxide | total_sulfur_dioxide | density | pH | sulphates | alcohol | quality | color | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.4 | 0.70 | 0.0 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 | r |
| 1 | 7.8 | 0.88 | 0.0 | 2.6 | 0.098 | 25.0 | 67.0 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 | r |
# concise summary about dataset
red.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1599 entries, 0 to 1598 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 fixed_acidity 1599 non-null float64 1 volatile_acidity 1599 non-null float64 2 citric_acid 1599 non-null float64 3 residual_sugar 1599 non-null float64 4 chlorides 1599 non-null float64 5 free_sulfur_dioxide 1599 non-null float64 6 total_sulfur_dioxide 1599 non-null float64 7 density 1599 non-null float64 8 pH 1599 non-null float64 9 sulphates 1599 non-null float64 10 alcohol 1599 non-null float64 11 quality 1599 non-null int64 12 color 1599 non-null object dtypes: float64(11), int64(1), object(1) memory usage: 162.5+ KB
white.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4898 entries, 0 to 4897 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 fixed_acidity 4898 non-null float64 1 volatile_acidity 4898 non-null float64 2 citric_acid 4898 non-null float64 3 residual_sugar 4898 non-null float64 4 chlorides 4898 non-null float64 5 free_sulfur_dioxide 4898 non-null float64 6 total_sulfur_dioxide 4898 non-null float64 7 density 4898 non-null float64 8 pH 4898 non-null float64 9 sulphates 4898 non-null float64 10 alcohol 4898 non-null float64 11 quality 4898 non-null int64 12 color 4898 non-null object dtypes: float64(11), int64(1), object(1) memory usage: 497.6+ KB
It is always interesting to know the basic statistical characteristics of each numerical variables.
# Basic Statistical details
red.describe()
| fixed_acidity | volatile_acidity | citric_acid | residual_sugar | chlorides | free_sulfur_dioxide | total_sulfur_dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 |
| mean | 8.319637 | 0.527821 | 0.270976 | 2.538806 | 0.087467 | 15.874922 | 46.467792 | 0.996747 | 3.311113 | 0.658149 | 10.422983 | 5.636023 |
| std | 1.741096 | 0.179060 | 0.194801 | 1.409928 | 0.047065 | 10.460157 | 32.895324 | 0.001887 | 0.154386 | 0.169507 | 1.065668 | 0.807569 |
| min | 4.600000 | 0.120000 | 0.000000 | 0.900000 | 0.012000 | 1.000000 | 6.000000 | 0.990070 | 2.740000 | 0.330000 | 8.400000 | 3.000000 |
| 25% | 7.100000 | 0.390000 | 0.090000 | 1.900000 | 0.070000 | 7.000000 | 22.000000 | 0.995600 | 3.210000 | 0.550000 | 9.500000 | 5.000000 |
| 50% | 7.900000 | 0.520000 | 0.260000 | 2.200000 | 0.079000 | 14.000000 | 38.000000 | 0.996750 | 3.310000 | 0.620000 | 10.200000 | 6.000000 |
| 75% | 9.200000 | 0.640000 | 0.420000 | 2.600000 | 0.090000 | 21.000000 | 62.000000 | 0.997835 | 3.400000 | 0.730000 | 11.100000 | 6.000000 |
| max | 15.900000 | 1.580000 | 1.000000 | 15.500000 | 0.611000 | 72.000000 | 289.000000 | 1.003690 | 4.010000 | 2.000000 | 14.900000 | 8.000000 |
white.describe()
| fixed_acidity | volatile_acidity | citric_acid | residual_sugar | chlorides | free_sulfur_dioxide | total_sulfur_dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 4898.000000 | 4898.000000 | 4898.000000 | 4898.000000 | 4898.000000 | 4898.000000 | 4898.000000 | 4898.000000 | 4898.000000 | 4898.000000 | 4898.000000 | 4898.000000 |
| mean | 6.854788 | 0.278241 | 0.334192 | 6.391415 | 0.045772 | 35.308085 | 138.360657 | 0.994027 | 3.188267 | 0.489847 | 10.514267 | 5.877909 |
| std | 0.843868 | 0.100795 | 0.121020 | 5.072058 | 0.021848 | 17.007137 | 42.498065 | 0.002991 | 0.151001 | 0.114126 | 1.230621 | 0.885639 |
| min | 3.800000 | 0.080000 | 0.000000 | 0.600000 | 0.009000 | 2.000000 | 9.000000 | 0.987110 | 2.720000 | 0.220000 | 8.000000 | 3.000000 |
| 25% | 6.300000 | 0.210000 | 0.270000 | 1.700000 | 0.036000 | 23.000000 | 108.000000 | 0.991723 | 3.090000 | 0.410000 | 9.500000 | 5.000000 |
| 50% | 6.800000 | 0.260000 | 0.320000 | 5.200000 | 0.043000 | 34.000000 | 134.000000 | 0.993740 | 3.180000 | 0.470000 | 10.400000 | 6.000000 |
| 75% | 7.300000 | 0.320000 | 0.390000 | 9.900000 | 0.050000 | 46.000000 | 167.000000 | 0.996100 | 3.280000 | 0.550000 | 11.400000 | 6.000000 |
| max | 14.200000 | 1.100000 | 1.660000 | 65.800000 | 0.346000 | 289.000000 | 440.000000 | 1.038980 | 3.820000 | 1.080000 | 14.200000 | 9.000000 |
Let's explore different statistical measures that we have got from desribe().
Note: 25%, 50%, and 75% are nothing but corresponding percentile values
Our brains are good at spotting patterns in pictures. Let's play around different types of data visualizations.
# first import data visualizations libraries
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib
# To ignore warnings
import warnings
warnings.filterwarnings('ignore')
# red wines
red.hist(bins=10, figsize=(16,12), grid = False)
plt.show()
The distribution of the attribute seems to be positively skewed. The attributes 'density' and 'pH' are quite normally distributed (you must have read about normal distribution in basic statistics study material). Now looking at the attribute quality, we can observe that the wines with average quality (i.e. quality rating 5 to 7) are more than bad and good quality of wines.
# white wine
white.hist(bins=10, figsize=(16,12), grid = False)
plt.show()
For white wines, the attribute 'pH' is quite normally distributed. The average quality of wines are more than good and bad qualities of wines for white wine data too. Most of the wines seem to be containing alcohol percentage in the range 8.5% to nearly 13%.
We can use pivot tables to observe the values of different features for each quality of wines.
# Creating pivot table for red wine
columns = list(red.columns).remove('quality')
red.pivot_table(columns, ['quality'], aggfunc=np.median) # By default the aggfunc is mean
| alcohol | chlorides | citric_acid | density | fixed_acidity | free_sulfur_dioxide | pH | residual_sugar | sulphates | total_sulfur_dioxide | volatile_acidity | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| quality | |||||||||||
| 3 | 9.925 | 0.0905 | 0.035 | 0.997565 | 7.50 | 6.0 | 3.39 | 2.1 | 0.545 | 15.0 | 0.845 |
| 4 | 10.000 | 0.0800 | 0.090 | 0.996500 | 7.50 | 11.0 | 3.37 | 2.1 | 0.560 | 26.0 | 0.670 |
| 5 | 9.700 | 0.0810 | 0.230 | 0.997000 | 7.80 | 15.0 | 3.30 | 2.2 | 0.580 | 47.0 | 0.580 |
| 6 | 10.500 | 0.0780 | 0.260 | 0.996560 | 7.90 | 14.0 | 3.32 | 2.2 | 0.640 | 35.0 | 0.490 |
| 7 | 11.500 | 0.0730 | 0.400 | 0.995770 | 8.80 | 11.0 | 3.28 | 2.3 | 0.740 | 27.0 | 0.370 |
| 8 | 12.150 | 0.0705 | 0.420 | 0.994940 | 8.25 | 7.5 | 3.23 | 2.1 | 0.740 | 21.5 | 0.370 |
For each quality of red wines, we can observe the median values of different features.
# Creating pivot table for red wine
columns = list(white.columns).remove('quality')
white.pivot_table(columns, ['quality'], aggfunc=np.median)
| alcohol | chlorides | citric_acid | density | fixed_acidity | free_sulfur_dioxide | pH | residual_sugar | sulphates | total_sulfur_dioxide | volatile_acidity | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| quality | |||||||||||
| 3 | 10.45 | 0.041 | 0.345 | 0.994425 | 7.3 | 33.5 | 3.215 | 4.60 | 0.44 | 159.5 | 0.26 |
| 4 | 10.10 | 0.046 | 0.290 | 0.994100 | 6.9 | 18.0 | 3.160 | 2.50 | 0.47 | 117.0 | 0.32 |
| 5 | 9.50 | 0.047 | 0.320 | 0.995300 | 6.8 | 35.0 | 3.160 | 7.00 | 0.47 | 151.0 | 0.28 |
| 6 | 10.50 | 0.043 | 0.320 | 0.993660 | 6.8 | 34.0 | 3.180 | 5.30 | 0.48 | 132.0 | 0.25 |
| 7 | 11.40 | 0.037 | 0.310 | 0.991760 | 6.7 | 33.0 | 3.200 | 3.65 | 0.48 | 122.0 | 0.25 |
| 8 | 12.00 | 0.036 | 0.320 | 0.991640 | 6.8 | 35.0 | 3.230 | 4.30 | 0.46 | 122.0 | 0.26 |
| 9 | 12.50 | 0.031 | 0.360 | 0.990300 | 7.1 | 28.0 | 3.280 | 2.20 | 0.46 | 119.0 | 0.27 |
We can check how each features are reated with others using corr() function.
The correlation value ranges between -1 to 1.When it is close to 1, it means that there is a strong positive correlation. When the coefficient is close to –1, it means that there is a strong negative correlation. Finally, coefficients close to zero mean that there is no linear correlation. We can observe the detail information using correlation matrix
# red wines
red.corr()
| fixed_acidity | volatile_acidity | citric_acid | residual_sugar | chlorides | free_sulfur_dioxide | total_sulfur_dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| fixed_acidity | 1.000000 | -0.256131 | 0.671703 | 0.114777 | 0.093705 | -0.153794 | -0.113181 | 0.668047 | -0.682978 | 0.183006 | -0.061668 | 0.124052 |
| volatile_acidity | -0.256131 | 1.000000 | -0.552496 | 0.001918 | 0.061298 | -0.010504 | 0.076470 | 0.022026 | 0.234937 | -0.260987 | -0.202288 | -0.390558 |
| citric_acid | 0.671703 | -0.552496 | 1.000000 | 0.143577 | 0.203823 | -0.060978 | 0.035533 | 0.364947 | -0.541904 | 0.312770 | 0.109903 | 0.226373 |
| residual_sugar | 0.114777 | 0.001918 | 0.143577 | 1.000000 | 0.055610 | 0.187049 | 0.203028 | 0.355283 | -0.085652 | 0.005527 | 0.042075 | 0.013732 |
| chlorides | 0.093705 | 0.061298 | 0.203823 | 0.055610 | 1.000000 | 0.005562 | 0.047400 | 0.200632 | -0.265026 | 0.371260 | -0.221141 | -0.128907 |
| free_sulfur_dioxide | -0.153794 | -0.010504 | -0.060978 | 0.187049 | 0.005562 | 1.000000 | 0.667666 | -0.021946 | 0.070377 | 0.051658 | -0.069408 | -0.050656 |
| total_sulfur_dioxide | -0.113181 | 0.076470 | 0.035533 | 0.203028 | 0.047400 | 0.667666 | 1.000000 | 0.071269 | -0.066495 | 0.042947 | -0.205654 | -0.185100 |
| density | 0.668047 | 0.022026 | 0.364947 | 0.355283 | 0.200632 | -0.021946 | 0.071269 | 1.000000 | -0.341699 | 0.148506 | -0.496180 | -0.174919 |
| pH | -0.682978 | 0.234937 | -0.541904 | -0.085652 | -0.265026 | 0.070377 | -0.066495 | -0.341699 | 1.000000 | -0.196648 | 0.205633 | -0.057731 |
| sulphates | 0.183006 | -0.260987 | 0.312770 | 0.005527 | 0.371260 | 0.051658 | 0.042947 | 0.148506 | -0.196648 | 1.000000 | 0.093595 | 0.251397 |
| alcohol | -0.061668 | -0.202288 | 0.109903 | 0.042075 | -0.221141 | -0.069408 | -0.205654 | -0.496180 | 0.205633 | 0.093595 | 1.000000 | 0.476166 |
| quality | 0.124052 | -0.390558 | 0.226373 | 0.013732 | -0.128907 | -0.050656 | -0.185100 | -0.174919 | -0.057731 | 0.251397 | 0.476166 | 1.000000 |
From the above correlation matrix, we can observe that there is a relatively high positive correlation between fixed_acidity and citric_acid, fixed_acidity and density. Similarly we can observe there is a relatively high negative correlation between fixed_acidity and pH. There is relatively high positive correlation between alcohol presence and quality of the wines.
# white wines
white.corr()
| fixed_acidity | volatile_acidity | citric_acid | residual_sugar | chlorides | free_sulfur_dioxide | total_sulfur_dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| fixed_acidity | 1.000000 | -0.022697 | 0.289181 | 0.089021 | 0.023086 | -0.049396 | 0.091070 | 0.265331 | -0.425858 | -0.017143 | -0.120881 | -0.113663 |
| volatile_acidity | -0.022697 | 1.000000 | -0.149472 | 0.064286 | 0.070512 | -0.097012 | 0.089261 | 0.027114 | -0.031915 | -0.035728 | 0.067718 | -0.194723 |
| citric_acid | 0.289181 | -0.149472 | 1.000000 | 0.094212 | 0.114364 | 0.094077 | 0.121131 | 0.149503 | -0.163748 | 0.062331 | -0.075729 | -0.009209 |
| residual_sugar | 0.089021 | 0.064286 | 0.094212 | 1.000000 | 0.088685 | 0.299098 | 0.401439 | 0.838966 | -0.194133 | -0.026664 | -0.450631 | -0.097577 |
| chlorides | 0.023086 | 0.070512 | 0.114364 | 0.088685 | 1.000000 | 0.101392 | 0.198910 | 0.257211 | -0.090439 | 0.016763 | -0.360189 | -0.209934 |
| free_sulfur_dioxide | -0.049396 | -0.097012 | 0.094077 | 0.299098 | 0.101392 | 1.000000 | 0.615501 | 0.294210 | -0.000618 | 0.059217 | -0.250104 | 0.008158 |
| total_sulfur_dioxide | 0.091070 | 0.089261 | 0.121131 | 0.401439 | 0.198910 | 0.615501 | 1.000000 | 0.529881 | 0.002321 | 0.134562 | -0.448892 | -0.174737 |
| density | 0.265331 | 0.027114 | 0.149503 | 0.838966 | 0.257211 | 0.294210 | 0.529881 | 1.000000 | -0.093591 | 0.074493 | -0.780138 | -0.307123 |
| pH | -0.425858 | -0.031915 | -0.163748 | -0.194133 | -0.090439 | -0.000618 | 0.002321 | -0.093591 | 1.000000 | 0.155951 | 0.121432 | 0.099427 |
| sulphates | -0.017143 | -0.035728 | 0.062331 | -0.026664 | 0.016763 | 0.059217 | 0.134562 | 0.074493 | 0.155951 | 1.000000 | -0.017433 | 0.053678 |
| alcohol | -0.120881 | 0.067718 | -0.075729 | -0.450631 | -0.360189 | -0.250104 | -0.448892 | -0.780138 | 0.121432 | -0.017433 | 1.000000 | 0.435575 |
| quality | -0.113663 | -0.194723 | -0.009209 | -0.097577 | -0.209934 | 0.008158 | -0.174737 | -0.307123 | 0.099427 | 0.053678 | 0.435575 | 1.000000 |
We can plot the above correlation matrix using heatmaps too. The visualization using heatmap is a pictorial visualization.
# Custom colour map based on Netflix palette
cmap = matplotlib.colors.LinearSegmentedColormap.from_list("", ['#221f1f', '#0045a4','#0e4f66'])
def Heatmap(df):
corr = df.corr()
mask = np.zeros_like(corr, dtype= bool)
mask[np.triu_indices_from(mask)] = True
fig, ax = plt.subplots(figsize=(10, 7), dpi = 100)
fig.text(.35,.93,'Correlation Coefficient', fontfamily='serif',fontweight='bold',fontsize=15)
pl = sns.heatmap(corr, mask = mask, annot = True, fmt = '.2f',cbar = True,
cmap=cmap, vmax=.3, vmin=-.3, center=0, square=True, linewidths=2.5)
plt.tick_params(which=u'both', axis=u'both', length=0)
plt.tight_layout()
# Show the plot
plt.show()
# Heat map of red wine
Heatmap(red)
Heatmap(white)
def fig():
rc = {'figure.figsize':(10,6),
'figure.dpi': 80,
'axes.facecolor':'white',
'axes.grid' : False,
'font.family':'serif',
'font.size' : 12}
plt.rcParams.update(rc)
# create a function to label axes and set title
def label(xaxis, yaxis, title):
"""This function set the label axes and title of plots"""
plt.xlabel(xaxis)
plt.ylabel(yaxis)
plt.title(title, fontweight = 'bold', fontsize = 15)
# Fuction to remove borders and clean the axes
def remove_borders():
g = sns.despine(top =True, left=True, right=True, bottom=True)
plt.tick_params(axis = u'both', which=u'both', length=0)
The attribute quality is categorical in nature and we can visualize this type of attributes using barplot or countplot.
# Countplot for quality of wines present in different category of wines (red and white)
fig()
sns.countplot(x = wine.quality, hue=wine.color,
width = 0.6)
# Add title and label axis
label('Quality', 'Count', 'Quality of wine')
# Remove spines
remove_borders()
# Show the plot
plt.show()
From above countplot, we can observe that the average quality of wines are more than good and bad quality of wines in both variants of wines.
We can visualize scatterplot matrix for the better understanding relationship between a pair of variables. It plots every numerical attribute against every other. 'pairplot' of seaborn helps to achieve this
# red wine
sns.pairplot(red)
plt.show()
The correlation between fixed_acidity and citric_acid is 0.67 (you could find this value in the correlation matrix of red wines). Looking for scatterplot for this pair of variables, we can see the positive linear correlation between these two variables. We can observe the upward trend, and also the points are not too dispersed.
Similarly we can plot scatter plot between variables for white wines.
# Pairplot
sns.pairplot(white)
plt.show()